Average Cost Temporal{diierence Learning 1

نویسندگان

John N Tsitsiklis

Benjamin Van Roy

چکیده

We propose a variant of temporal di erence learning that approximates average and di erential costs of an irreducible aperiodic Markov chain Approximations are comprised of linear combinations of xed basis functions whose weights are incrementally updated during a single endless trajectory of the Markov chain We present a proof of convergence with probability and a characterization of the limit of convergence We also provide a bound on the resulting approximation error that exhibits an interesting dependence on the mixing time of the Markov chain The results parallel previous work by the authors involving approximations of discounted cost to go

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On Average versus Discounted Reward Temporal{diierence Learning

We provide an analytical comparison between discounted and average reward temporal{ di erence (TD) learning with linearly parameterized approximations. We rst consider the asymptotic behavior of the two algorithms. We show that as the discount factor approaches 1, the value function produced by discounted TD approaches the di erential value function generated by average reward TD. We further ar...

متن کامل

Learning to Achieve Goals

Temporal diierence methods solve the temporal credit assignment problem for reinforcement learning. An important subproblem of general reinforcement learning is learning to achieve dynamic goals. Although existing temporal diierence methods, such as Q learning, can be applied to this problem, they do not take advantage of its special structure. This paper presents the DG-learning algorithm, whi...

متن کامل

Analytical Mean Squared Error Curves in Temporal Diierence Learning

We have calculated analytical expressions for how the bias and variance of the estimators provided by various temporal diierence value estimation algorithms change with ooine updates over trials in absorbing Markov chains using lookup table representations. We illustrate classes of learning curve behavior in various chains, and show the manner in which TD is sensitive to the choice of its step-...

متن کامل

Truncating Temporal Differences: On the Efficient Implementation of TD(lambda) for Reinforcement Learning

Temporal diierence (TD) methods constitute a class of methods for learning predictions in multi-step prediction problems, parameterized by a recency factor. Currently the most important application of these methods is to temporal credit assignment in reinforcement learning. Well known reinforcement learning algorithms, such as AHC or Q-learning, may be viewed as instances of TD learning. This p...

متن کامل